perf(core): Eliminate stat() syscall and lazy-load encoding libraries in fileRead by yamadashy · Pull Request #1399 · yamadashy/repomix

yamadashy · 2026-04-05T05:33:49Z

Two optimizations to reduce file reading overhead:

1. Remove redundant fs.stat() before fs.readFile()

Previously each file required stat() (size check) then readFile() = 2 syscalls
Now readFile() runs first, then buffer.length is checked = 1 syscall
Files exceeding maxFileSize (default 10MB) are rare; the occasional oversized read is acceptable for halving syscall count on all files

2. Lazy-load jschardet and iconv-lite

These libraries have ~25ms combined import cost at startup
The fast UTF-8 path (covers ~99% of source code files) never needs them
They are only loaded on first encounter of a non-UTF-8 file

Extracted from #1377. Pure performance optimization with no functional changes.

Checklist

Run npm run test
Run npm run lint

… in fileRead Two optimizations to reduce file reading overhead: 1. Remove redundant fs.stat() before fs.readFile() - Previously each file required stat() (size check) then readFile() = 2 syscalls - Now readFile() runs first, then buffer.length is checked = 1 syscall - Files exceeding maxFileSize (default 10MB) are rare; the occasional oversized read is acceptable for halving syscall count on all files 2. Lazy-load jschardet and iconv-lite - These libraries have ~25ms combined import cost at startup - The fast UTF-8 path (covers ~99% of source code files) never needs them - They are only loaded on first encounter of a non-UTF-8 file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-05T05:34:01Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: feff0237-cb02-46e8-a5d9-bc651de02a8d

📥 Commits

Reviewing files that changed from the base of the PR and between 611b88a and f682fe9.

📒 Files selected for processing (1)

src/core/file/fileRead.ts

📝 Walkthrough

Walkthrough

The PR optimizes file reading in readRawFile by eliminating a redundant fs.stat() syscall (now uses buffer length for size validation), adds early logging, and implements lazy loading of encoding detection dependencies via a cached helper that dynamically imports jschardet and iconv-lite only on the non-UTF-8 slow path.

Changes

Cohort / File(s)	Summary
File Read Optimization `src/core/file/fileRead.ts`	Consolidates `fs.stat()` + conditional logic into single `fs.readFile()` call; replaces `stats.size` check with `buffer.length` for max file size validation; implements lazy loading of `jschardet` and `iconv-lite` via cached `getEncodingDeps()` helper using dynamic `import()`; adds early "Reading file" log statement.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

PR #1007: Modifies encoding detection and decoding logic in fileRead.ts using jschardet and iconv with conditional skip behavior.
PR #1303: Eliminates fs.stat() call in favor of single fs.readFile() with buffer.length for size checks in file read path.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/eliminate-stat-syscall

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-05T05:34:02Z

⚡ Performance Benchmark

Latest commit:	`f682fe9` perf(core): Eliminate stat() syscall and lazy-load encoding libraries in fileRead
Status:	✅ Benchmark complete!
Ubuntu:	1.59s (±0.03s) → 1.57s (±0.03s) · -0.02s (-1.0%)
macOS:	1.46s (±0.23s) → 1.44s (±0.15s) · -0.01s (-1.0%)
Windows:	2.03s (±0.02s) → 1.97s (±0.03s) · -0.06s (-2.9%)

Details

Packing the repomix repository with node bin/repomix.cjs
Warmup: 2 runs (discarded), interleaved execution
Measurement: 20 runs / 30 on macOS (median ± IQR)
Workflow run

codecov · 2026-04-05T05:35:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.41%. Comparing base (611b88a) to head (f682fe9).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1399      +/-   ##
==========================================
+ Coverage   87.40%   87.41%   +0.01%     
==========================================
  Files         116      116              
  Lines        4389     4393       +4     
  Branches     1018     1019       +1     
==========================================
+ Hits         3836     3840       +4     
  Misses        553      553

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

gemini-code-assist

Code Review

This pull request optimizes file reading by lazy-loading encoding detection libraries and removing the fs.stat syscall to reduce I/O operations. Feedback indicates a potential race condition and type safety issues in the lazy-loading logic. Additionally, there is a concern that removing the early size check via fs.stat could lead to high memory consumption or out-of-memory errors when processing very large files.

gemini-code-assist · 2026-04-05T05:35:18Z

+let _jschardet: typeof import('jschardet') | undefined;
+let _iconv: typeof import('iconv-lite') | undefined;
+const getEncodingDeps = async () => {
+  if (!_jschardet || !_iconv) {
+    [_jschardet, _iconv] = await Promise.all([import('jschardet'), import('iconv-lite')]);
+  }
+  return { jschardet: _jschardet, iconv: _iconv };
+};


The current implementation of getEncodingDeps has a potential race condition. If multiple concurrent calls to readRawFile trigger the slow path (non-UTF-8 files) before the first import() completes, the modules will be imported multiple times. Additionally, the return type of the function includes undefined for both dependencies, which will cause TypeScript errors or runtime crashes when calling encodingDeps.jschardet.detect() at line 73.

Using a single promise to manage the lazy loading ensures that the modules are only loaded once and provides a clean, non-nullable return type for the caller.

let encodingDepsPromise: Promise<{ jschardet: typeof import('jschardet'); iconv: typeof import('iconv-lite'); }> | undefined; const getEncodingDeps = async () => { if (!encodingDepsPromise) { encodingDepsPromise = Promise.all([ import('jschardet'), import('iconv-lite'), ]).then(([jschardet, iconv]) => ({ jschardet: (jschardet as any).default || jschardet, iconv: (iconv as any).default || iconv, })); } return encodingDepsPromise; };

gemini-code-assist · 2026-04-05T05:35:18Z

+    // Read the file directly and check size afterward, avoiding a separate stat() syscall.
+    // This halves the number of I/O operations per file.
+    // Files exceeding maxFileSize are rare, so the occasional oversized read is acceptable.
+    const buffer = await fs.readFile(filePath);


Removing the fs.stat check before fs.readFile introduces a risk of high memory consumption. While it reduces the number of syscalls, any file exceeding maxFileSize (e.g., a large log file, dataset, or binary not caught by extension checks) will now be fully loaded into memory before the size check occurs. In extreme cases, this could lead to OutOfMemoryError or process crashes if the file is very large (e.g., several GBs).

Consider if the performance gain of one less stat syscall (which is metadata-only and very fast) outweighs the safety of the early size check, especially for a tool that might be run on diverse and potentially large repositories.

claude · 2026-04-05T05:37:29Z

Code Review — `perf(core): Eliminate stat() syscall and lazy-load encoding libraries`

Clean, well-scoped PR with good inline documentation. Two observations worth considering:

1. Memory impact of reading oversized files before rejection

Priority: Recommended

Removing stat() means files are now fully read into memory before the size check. Combined with:

maxFileSize default of 50 MB (not 10 MB as the PR description states)
FILE_COLLECT_CONCURRENCY = 50 parallel workers

The theoretical worst case is 50 × 50 MB = 2.5 GB of heap allocation before any file is discarded — e.g., a repo with many large .sql or .csv files that pass the binary-extension check.

The old stat()-first approach rejected these with near-zero memory cost. For typical repos this is fine, but for the MCP server use case (potentially untrusted repos), it's worth acknowledging.

Possible mitigations (not necessarily needed now)

Add a comment noting the concurrency × maxFileSize memory ceiling
Consider fs.open() + fstat() on the fd (single open, no TOCTOU race, no full read) if this becomes an issue
Or cap concurrent reads when file sizes are unknown

On the positive side: removing the separate stat() call eliminates the TOCTOU race that existed before (file could be swapped between stat and read). That's a security improvement.

2. Lazy-load caching pattern — consider a singleton promise

Priority: Recommended

The current pattern with two separate let variables has a benign race under concurrency: if multiple files fail UTF-8 decoding simultaneously, all concurrent callers see _jschardet === undefined before any await resoletes, triggering redundant Promise.all calls.

A cleaner idiomatic pattern that also fixes the TypeScript type narrowing:

let _encodingDepsPromise: Promise<{ jschardet: typeof import('jschardet'); iconv: typeof import('iconv-lite') }> | undefined;
const getEncodingDeps = () => {
  if (!_encodingDepsPromise) {
    _encodingDepsPromise = Promise.all([import('jschardet'), import('iconv-lite')])
      .then(([jschardet, iconv]) => ({ jschardet, iconv }));
  }
  return _encodingDepsPromise;
};

This is the standard singleton-promise pattern: one cached Promise, no split-state risk, correct under concurrency, and the resolved type is non-optional.

Minor notes

PR description says "default 10MB" — the actual default is 50 MB (configSchema.ts:84). Worth correcting for accuracy.
Test coverage is adequate for the behavioral changes. One gap worth considering: there's no test for a valid non-UTF-8 file decoded successfully (e.g., Latin-1/Shift-JIS). The existing slow-path test only covers the rejection branch, not the success branch — so a getEncodingDeps wiring regression wouldn't be caught.
Leading underscore naming (_jschardet, _iconv) — Airbnb style discourages dangling underscores. Minor nit.

Overall this is a solid performance optimization. The lazy-load of encoding libraries is clearly beneficial (~25ms startup saving, UTF-8 fast path is great). The stat removal is a reasonable tradeoff for typical usage. 👍

🤖 Generated with Claude Code

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

yamadashy closed this Apr 5, 2026

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

devin-ai-integration bot reviewed Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(core): Eliminate stat() syscall and lazy-load encoding libraries in fileRead#1399

perf(core): Eliminate stat() syscall and lazy-load encoding libraries in fileRead#1399
yamadashy wants to merge 1 commit intomainfrom
perf/eliminate-stat-syscall

yamadashy commented Apr 5, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

coderabbitai bot commented Apr 5, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

github-actions bot commented Apr 5, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

claude bot commented Apr 5, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yamadashy commented Apr 5, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

coderabbitai bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

github-actions bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚡ Performance Benchmark

Uh oh!

codecov bot commented Apr 5, 2026

Codecov Report

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 5, 2026

Code Review — perf(core): Eliminate stat() syscall and lazy-load encoding libraries

1. Memory impact of reading oversized files before rejection

2. Lazy-load caching pattern — consider a singleton promise

Minor notes

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yamadashy commented Apr 5, 2026 •

edited by devin-ai-integration bot

Loading

coderabbitai bot commented Apr 5, 2026 •

edited

Loading

github-actions bot commented Apr 5, 2026 •

edited

Loading

Code Review — `perf(core): Eliminate stat() syscall and lazy-load encoding libraries`